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Most methods for next-generation sequencing (NGS] data analyses incorporate information regarding allele frequencies 
using the assumption of Hardy-Weinberg equilibrium (HWE) as a prior. However, many organisms including those that 
are domesticated, partially selfing, or with asexual life cycles show strong deviations from HWE. For such species, and 
specially for low-coverage data, it is necessary to obtain estimates of inbreeding coefficients (F) for each individual before 
calling genotypes. Here, we present two methods for estimating inbreeding coefficients from NGS data based on an 
expectation-maximization (EM) algorithm. We assess the impact of taking inbreeding into account when calling genotypes 
or estimating the site frequency spectrum (SFS), and demonstrate a marked increase in accuracy on low-coverage highly 
inbred samples. We demonstrate the applicability and efficacy of these methods in both simulated and real data sets. 



[Supplemental material is available for this article.] 

Next-generation sequencing (NGS) methods provide fast, cheap, 
and reliable large-scale DNA sequencing data. They are used in de 
novo sequencing, disease mapping, gene expression, and in pop- 
ulation genetic studies, providing rapid and complete sequencing 
of candidate genes, exomes, transcriptomes, or even whole genomes 
(Nagalakshmi et al. 2008; Liti et al. 2009; The 1000 Genomes 
Project Consortium 2010; Li et al. 2010; Ng et al. 2010). Current 
NGS technologies produce short read sequences that are de novo 
assembled or mapped (aligned) to a reference genome and used for 
SNP or genotype calling. However, these data typically have high 
error rates due to multiple factors, from random sampling of ho- 
mologous base pairs in heterozygotes, to sequencing or alignment 
errors. Furthermore, many NGS studies rely on low-coverage se- 
quence data (<5 x per site per individual), causing SNP and genotype 
calling to be associated with considerable statistical uncertainty. 

Recent methods rely on probabilistic frameworks to account 
for these errors and accurately call SNPs and genotypes, even at low 
coverage (Martin et al. 2010; Li 2011; Nielsen et al. 2012). These 
methods integrate the base quality score together with other error 
sources (e.g., mapping or sequencing errors) to calculate an overall 
"genotype likelihood." More specifically, the likelihood at each 
locus / and individual i is defined as 

L Gil =p(X u \G u ), G a e{0,l,2} (1) 

where X u is the observed sequencing data and G n the number of 
minor alleles in individual i at site /. Here and throughout the rest 
of this paper, we assume that a minor allele can be defined. There 
is no loss of generality in this because any arbitrary definition of 
major and minor allele can be used and switching the labeling 
of alleles does not affect the inference framework discussed in 
this paper. 



The genotype likelihood can be calculated in several different 
ways (Li et al. 2009a, b; DePristo et al. 2011), usually by taking se- 
quencing quality of the reads into account. Genotypes can then 
be called based on their likelihoods by selecting the one with the 
highest likelihood. Some studies use more stringent criteria and 
only call a genotype if the highest likelihood genotype is sub- 
stantially more likely than the second one (common threshold is 
10 times more likely); otherwise, the genotype is considered missing 
data (Kim et al. 2011). 

To further improve genotype calling, the likelihood function 
can be combined with a prior, p(G n ), to calculate the genotype 
posterior probability, p(Gu\Xij). In this case, the genotype with the 
highest posterior probability is generally chosen, and this proba- 
bility (or the ratio between the highest and the second highest 
probabilities) is used as a measure of confidence. This way it is 
possible to improve genotype calling, develop associated measures 
of statistical uncertainty, and provide a natural framework for in- 
corporating prior information (Li et al. 2008; The 1000 Genomes 
Project Consortium 2010). Various types of information can be 
used as priors, including information from SNP databases, a refer- 
ence genome, patterns of linkage disequilibrium (LD) and, most 
importantly, information regarding allele frequencies from a larger 
sample or from a reference panel (for review, see Nielsen et al. 
2011). Incorporation of allele frequencies is usually based on the 
assumption of Hardy-Weinberg equilibrium (HWE). However, 
HWE assumes random mating, and, while this assumption might 
approximately hold for most species (e.g., humans), it is clearly 
violated in others like self-pollinating plants and domesticated 
species (due to inbreeding and clonal propagation), as well as 
species with asexual life cycles. This violation can result in the 
undercalling of homozygous genotypes and biases in downstream 
analyses, as we show below, but there are extensions to the HWE 
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that account for these deviations, namely, the inclusion of an in- 
breeding coefficient (F) defined, for a di-allelic locus with alleles 
A and a, as 

f Aa = 2{l-f)f{l-F) 

faa = f 2 + a-f)fF (2) 

where f pq is the frequency of genotype pq and fits minor allele (a) 
frequency (MAF). If the genotypes are known, the log-likelihood 
function (for a single locus and n individuals) for the parameters 
F and fis given by 

log[p(G|f , F)] = nAA log[(l - ff + (1 - f)ff\ 
+ n Aa \og[2(l-f)f(l-F)] 
+ n aa \og[f 2 + (l-f)fF] (3) 

where n^, n Aa , and n aa are the observed counts of genotypes AA,Aa, 
and aa, respectively (n = Haa + n Aa + n aa ), and G is a vector of ob- 
served genotypes from which Uaa, n Aa , and n aa can be calculated. A 
joint maximum likelihood (ML) estimate of F and fis obtained as 

- _ n Aa + 2n aa 

' ~ 2n ( ' 



where H E and E[H E ] are the observed and expected number of 
heterozygotes genotypes, respectively, and 

E[H E ]=n2f(l-f). (6) 

Consider now a model in which the value of F may differ 
among individuals with individual i having inbreeding coefficient 
F h F = (Fi, F 2 , ... , F„), and assume that allele frequencies fi are 
available for k loci, f = (f lr f 2 , ■■■ , fid- Assuming independence 
among sites, the joint likelihood function for F and f is then given 

by 

log[p(G|f,F)] = ti [v„log[(l -f ( ) 2 + (1 -fdff] 
+ /ilog[2(l-f ( )f ,(!-*■)] 

+ W°g[f? + (W/Wd]- (7) 

Here I pq il is an indicator function, which is equal to one if 
the genotype of individual / in locus / is equal to pq. This likeli- 
hood function has no simple solution and must be optimized 
numerically. 

For this likelihood function, even in the simple case of a single 
site and a shared value of F, estimation requires the availability of 
known genotypes for each individual. This is a challenge in the 
analysis of NGS data, because the value of F in itself is important for 
genotype calling. To address this issue, we developed two algo- 
rithms for estimating inbreeding coefficients, both per individual 
(find) and per site (F site ), from NGS data under a probabilistic 
framework based directly on genotype likelihoods. These estimates 
can then be incorporated into the genotype-calling algorithm to 
provide improved calculations of genotype posterior probabilities. 
We demonstrate the accuracy of our method using simulation and 



show that the new method leads to increased accuracy in genotype 
calling and estimation of the site frequency spectrum (SFS). Finally, 
we apply our method to a previously published rice data set (Xu et al. 
2011) and show marked improvements over previous methods. 

Results 

Estimating per-site inbreeding coefficients from simulated data 

During standard NGS data analyses, one of the most crucial steps is 
quality control. Several different filters are usually applied to ex- 
clude anomalous sites, using base quality bias, strand quality bias, 
extremely high/low sequencing coverage, or deviations from HWE 
(Xia et al. 2009; The 1000 Genomes Project Consortium 2010; Xu 
et al. 2011). To test for deviations from HWE, the expected geno- 
type frequencies under HWE (calculated using the observed allele 
frequencies) are compared with the observed frequencies through 
a x 2 or Fisher's exact test. However, somewhat inconveniently, 
these tests can only be done after genotypes have been called. 
Here, we suggest a new method to jointly estimate, per site, both 
MAF and inbreeding coefficients (F sitl ), using an expectation- 
maximization (EM) algorithm (Ceppellini et al. 1955; Smith and 
Thomson 1988). This method forms the basis for a likelihood ratio 
test of HWE (H Q : F = 0) that can be applied to filter sites before 
genotypes have been called. 

To assess the accuracy of our method, we applied it to a sim- 
ulated data set of 10,000 variable sites under several different pa- 
rameter combinations. For each of the 495 combinations of pa- 
rameters, we estimated the inbreeding coefficients per site (F slte ) 
and plotted them together with their associated root mean square 
deviation (RMSD). Our results show that this method has reason- 
ably good accuracy in estimating inbreeding coefficients per site 
with sequencing coverage >3 x , sample sizes of 30 individuals, and 
an error rate of 0.5% (Fig. 1, right column). However, not surpris- 
ingly, high error rates, low coverage, and small sample sizes will 
result in reduced accuracy compared with estimates based on full 
knowledge of the genotypes (Supplemental Fig. 1). As typical for 
a bounded parameter, for small sample sizes the estimator becomes 
heavily biased when the true value is close to the boundary of the 
parameter space. 

Estimating individual inbreeding coefficients from simulated 
data 

Although inbreeding coefficients per site can be useful for quality 
control (filtering sites that depart from HWE), a more interesting 
and biologically meaningful parameter is the inbreeding co- 
efficient per individual. Estimates of this parameter can shed light 
into the species' mating system and past history (domestication), 
as well as be used as a prior to improve genotype-calling algo- 
rithms. To this end, we extended a recently published algorithm by 
Hall et al. (2012) to estimate per-individual inbreeding coefficients 
directly from genotype likelihoods. 

To assess the accuracy of this method, we applied it to the 
same simulated data set as in the previous section. For each of 
the 495 combination of parameters, we estimated inbreeding co- 
efficients per individual (F ind ) and plotted them together with their 
associated RMSD. In all surveyed scenarios, the method presented 
here largely outperformed the original one, with lower RMSD and 
estimates closer to the true value (Fig. 1, left and center columns). This 
trend is even clearer in cases of extremely low coverage (lx), small 
sample sizes (10 individuals), and high error rates (Supplemental 
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Figure 1. Estimation of inbreeding coefficients. Performance of the EM method to infer F ind from called genotypes (left column), F ind from genotype 
likelihoods (center column), and F site (right column), for a sample size of 1 0 (first row) and 30 (second row) individuals and 1 0,000 variable sites simulated 
with a 0.5% error rate. Line styles and symbols represent different simulated sequencing coverages. Filled lines represent the inferred value for each 
simulated scenario (Infer. F), while dotted lines represent its RMSD. 



Figs. 2, 3). As an example, in a 1 x data set with an average error rate 
of 0.5% and a sample size of 10 individuals, we obtain very accu- 
rate estimates, with the RMSD always smaller than 0.085, while the 
original method, applied to called genotypes, resulted in RMSDs as 
high as 0.41. 

In these simulations, we assumed all sites to be independent. 
However, in real data, loci are linked, resulting in a lower number 
of available independent loci. In a partially selfing population, 
where S is the proportion of selfing, the effective population size is 
reduced by a factor of 1 - S/2 and the effective recombination rate 
is reduced by a factor of y^jp (Golding and Strobeck 1980). As an 
example, with a selfing rate of S = 2/3, the effective recombination 
rate is reduced by a factor of 2, effectively reducing the number of 
independent loci. To assess the impact of a reduced number of 
effective sites on our estimates, we repeated the same simulations 
using half the effective number of independent variable sites (5000) 
and obtained similar results (Supplemental Fig. 4). Furthermore, 
and to fully address the impact of non-independence of sites, we 
simulated a more realistic 5-Mb genomic sequence, using as pa- 
rameters previous estimates for rice populations and two realistic 
self-pollinating rates (S e {0.7, 0.95|) (for details, see Methods). If all 
inbreeding is due to selfing, these rates correspond to theoretical 
inbreeding coefficient values of 0.54 and 0.90 (F=^) (Haldane 
1924). Using our method, we obtained relatively accurate estimates 
of 0.64 and 0.84, respectively, demonstrating the robustness of the 
presented method even in the presence of linked sites. We notice 
that when sites are not independent the ML estimator is not truly 
a ML estimator and, therefore, should be considered a composite 
likelihood estimator. To form a proper ML inference procedure, data 
can be filtered to remove linked sites, but such filtering will lead to 
a loss of information. 

This method turned out to be quite slow (on average 3.5 min 
and 147 iterations) and led us to develop a faster approximate 



algorithm that can be used for the initial iterations of the algo- 
rithm, greatly speeding up the analysis when analyzing large data 
sets (see Methods). 

Effect of inbreeding on genotype calling 

Several factors can bias genotype calling, including high error 
rates, inbreeding, sequencing coverage, and small sample sizes. To 
assess the impact of inbreeding on genotype call performance, we 
used the previously mentioned simulated data to call genotypes 
using a Bayesian approach under two different priors for the ge- 
notype frequencies: random mating (HWE; F = 0) and inferred 
inbreeding coefficient. 

Assuming random mating in the prior yields constant 
genotype-calling error rates, independently of the sample in- 
breeding levels. When all sites are considered, proportions of 
miscalled genotypes are between 0.1 and 0.25, being unequally 
distributed between heterozygotes and homozygotes: 0.3-0.55 and 
0.1-0.25, respectively (Fig. 2). However, in highly inbred samples, 
being able to incorporate inbreeding in the prior can greatly reduce 
genotype-calling errors, often to less than half of when assuming 
HWE (Fig. 2, left column). Considering homozygous and hetero- 
zygous genotypes separately provides additional insight into the 
effect of the priors. When assuming inbreeding, heterozygous ge- 
notype calling performs slightly worse (—30%) since the prior as- 
signs a lower probability on heterozygote genotypes (Fig. 2, center 
column). However, this increase in the heterozygous genotype- 
calling error rate is offset by the improvement in homozygous 
genotype calling, here assuming that inbreeding greatly reduces 
this error by as much as 60% (Fig. 2, right column). This level of 
improvement can be very important if we consider that highly 
inbred samples are almost exclusively homozygous (see also Sup- 
plemental Figs. 6-8). 
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Figure 2. Effect of the inbreeding coefficient on genotype calling. Performance of genotype calling globally (left column), on just heterozygous 
genotypes (center column) and just homozygous genotypes (right column), on a sample size of 10 (first row) and 30 (second row) individuals and 1 0,000 
variable sites simulated with a 0.5% error rate. Line styles and symbols represent different simulated sequencing coverages. Line types represent the level of 
inbreeding assumed in the priors: F=0 (HWE; filled) and inferred value of F (Infer. F; large dashes). Missing F= 1 values reflect the absence of heterozygous 
genotypes on a totally inbred sample. 



Effect of inbreeding on SFS 

Allele frequencies and their distribution are important summaries 
of population genetic data analyses (Li 2011). Many widely used 
statistics such as Tajima's D, Fu and Li's D, Fay and Wu's H, or F ST 
(Nielsen 2005; Holsinger and Weir 2009) are direct functions of the 
SFS. These statistics can be used to infer demographic histories and 
to quantify the effect of natural selection. Given their importance 
in population genetic studies, it is of great interest to be able to 
estimate them reliably. 

To assess the magnitude of inbreeding-related errors associ- 
ated with SFS estimation, we inferred the SFS on the same simu- 
lated data set and under the same priors for the genotype fre- 
quencies as before: random mating (HWE; F = 0) and inferred 
inbreeding coefficient. We used both the standard approach based 
on called genotypes (see Methods) and a recent probabilistic 
method by Nielsen et al. (2012). High inbreeding coefficients have 
a marked effect on SFS estimation, and can increase the RMSD in 
the estimate of the SFS many fold (Fig. 3; Supplemental Figs. 9, 10). 
The inclusion of a correct prior will eliminate this problem, pro- 
viding estimates of the SFS that are as good, or better, than the 
estimates obtained in the presence of no inbreeding. Not surpris- 
ingly, the probabilistic method performs overall better than using 
called genotypes. However, the difference between using called 
genotypes and the probabilistic approach is much smaller here 
than observed in other studies (Kim et al. 201 1; Nielsen et al. 2012), 
because only true SNPs are included in these simulations, allevi- 
ating the problem of an excess of false singletons (and to a lesser 
degree doubletons) in methods based on genotype calling. 

Application to real data 

To illustrate the relevance of our method, we applied it to a publicly 
available data set of both wild and domesticated (cultivated) rice 



accessions (Xu et al. 2011). Cultivated rice (Oryza sativa) is classi- 
fied into two major subspecies (O. s. japonica and O. s. indica) and 
further subdivided into genetically differentiated groups. There are 




Figure 3. Effect of the inbreeding coefficient on SFS estimation. Per- 
formance of SFS estimation from called genotypes (left column) and the 
Nielsen et al. (2012) method from GL and assuming inbreeding (right 
column), on a sample size of 1 0 (first row) and 30 (second row) individuals 
and 1 0,000 variable sites simulated with a 0.5% error rate. Line styles and 
symbols represent different simulated sequencing coverages. Line types 
depict the level of inbreeding assumed in the priors: F= 0 (HWE; filled) or 
inferred value of F (Infer. F; large dashes). 



Genome Research 1855 



www.genome.org 



Vieira et al. 



also several species of wild rice, with the Oryza rufipogon species 
complex thought to be the closest to domesticated rice (e.g., Grillo 
et al. 2009; Wei et al. 2012). This species complex includes two forms: 
one perennial, photoperiod sensitive, and partially cross-fertilized 
(O. rufipogon); and another annual, photoperiod insensitive, and 
predominantly self-fertilized (Oryza nivara). The phenotypic dif- 
ferences between them have spurred a longstanding debate over 
the origins of cultivated rice, with some works assuming them to 
be different species (Sang and Ge 2007; Grillo et al. 2009), while 
others consider them as just ecotypes of a single species (Oka 1988; 
Zhu et al. 2007; Huang et al. 2012a; Wei et al. 2012). 

The diversity of mating systems, as well as the presence of 
both domesticated and wild forms, makes rice an interesting sys- 
tem for which to validate our newly developed methods. Among 
wild accessions the self-crossing rate is quite variable, although 
O. rufipogon tends to have lower rates than O. nivara: 50%-80% and 
75%-95%, respectively (Morishima et al. 1984; Oka 1988; Gao et al. 
2002; Phan et al. 2012). As for the cultivated accessions, they are 
thought to be almost totally inbred with self-crossing rates close to 
95%, although O. s. indica has been described as having slightly 
lower rates (Oka 1988). Using our method, we aimed to estimate 
per-individual inbreeding coefficients of all studied 65 rice acces- 
sions. Since the level of population structure is not clear for these 
species, we analyzed each one of them separately. Our estimates 
show O. rufipogon with an intermediate level of inbreeding (F ind ~ 
0.35), while Oryza nivara, O. s. indica, and O. s. japonica present 
significantly higher values around 0.6, 0.52, and 0.6, respectively 
(Fig. 4; Supplemental Table 1). 

To assess the impact of explicitly assuming inbreeding on SFS 
estimation, we estimated it for each of the four rice species/ 
subspecies. We used two different priors (random mating and esti- 
mated inbreeding coefficients) over two different methods (the 
probabilistic method by Nielsen et al. 2012 and using calling ge- 
notypes). Figure 5 shows that even for high coverage data (~10x) 
(O. s. indica and O. s. japonica in Fig. 5), methods assuming HWE 
have an excess of singletons compared with methods that take 
inbreeding into account. This is a result of the greater weight the 
HWE prior gives to heterozygous genotypes, and the effect is 
stronger for genotype-calling methods than for the probabilistic 
method providing direct estimates of the SFS. In the data sets 
that also include low coverage samples (<5x) (Fig. 5, top row), the 
probabilistic method gives similar results irrespective of the prior 
used. However, the genotype-calling method, particularly assuming 
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Figure 4. Boxplot analysis of inferred per-individual inbreeding esti- 
mates. Each population was analyzed independently and the inferred 
inbreeding coefficients plotted. 



Figure S. Estimated SFS on the analyzed rice population. SFS was es- 
timated on the four populations using called genotypes (CG) and the 
Nielsen et al. (2012) method (SFS). In both cases, two priors were used: 
random mating (HWE) and inferred per-individual inbreeding estimates (F). 

HWE, estimates many more singletons than other methods. How- 
ever, both data sets contain high (10X) and low (2x-3x) coverage 
samples. To make sure the observed SFS differences were not caused 
by the presence of high-coverage accessions in the sample, we re- 
peated the analysis on just the 10 low-coverage O. rufipogon acces- 
sions and found a similar trend (Supplemental Fig. 11). All in all, 
these results illustrate the importance of taking inbreeding into 
account when estimating allele frequencies, particularly in methods 
based on genotype calling. 

Discussion 

While sequencing is becoming cheaper, there is an increasing de- 
mand for larger data sets, suggesting that low-coverage data will be 
common for years to come. When analyzing such data there can 
be considerable uncertainty, and inbreeding may, as illustrated by 
our results, have a marked effect on downstream analyses. Current 
NGS data analyses methods are mostly tuned for human pop- 
ulations and usually assume that the populations are in HWE. 
Although this is true for many species (e.g., human and mouse), 
there are self -pollinating plants (e.g., Arabidopsis) and domesti- 
cated species (e.g., rice, maize, dog), as well as species with asexual 
life cycles (e.g., daphnia, aphids, wasps) that are expected to have 
extremely high levels of inbreeding. Furthermore, many NGS data 
sets are being produced for domesticated species, due to their eco- 
nomic importance, and many of these species have significant 
amounts of inbreeding. It is therefore of great importance to include 
techniques for incorporating inbreeding when analyzing NGS data. 

In this study, we developed algorithms to deal with inbred 
NGS data, either by estimating inbreeding coefficient per site or per 
individual. The per-site algorithm is mainly aimed at NGS quality 
control by removing sites that deviate from HWE. Usually, these 
deviations are done by comparing the expected genotype fre- 
quencies under HWE with the observed ones through a x 2 or 
Fisher's exact test. However, in such analyses, genotypes need to be 
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called first, possibly introducing biases in the downstream analy- 
ses. Our approach forms the basis for a likelihood ratio test for 
deviations from HWE (H 0 : F = 0) that can directly test the sites 
before calling genotypes. 

Nevertheless, a more interesting and biologically meaningful 
parameter is the inbreeding coefficient per individual. This can 
shed light into the species' mating system and past history (do- 
mestication), as well as be used as a prior in genotype calling, SFS, 
or other algorithms. Several methods have been published to infer 
per-individual inbreeding coefficients (Vogl et al. 2002; Leutenegger 
et al. 2003; Wang et al. 2006; Moltke et al. 2011), but all were 
designed for genotype (marker) data. Although all present slight 
improvements, Hall et al. (2012) recently incorporated most fea- 
tures into a single EM algorithm and showed that it outperformed 
previous methods. Here, we have modified this algorithm to ac- 
commodate for NGS data, as well as an approximate EM algorithm 
that can help speed up convergence. We notice that the rate of 
convergence can be further increased by using an accelerated EM 
approximation (Jamshidian and Jennrich 1993), although such an 
approach was not pursued here since we considered the running 
times to be acceptable (Supplemental Table 2). 

In all scenarios examined, the new method presented here 
largely outperformed the original Hall et al. (2012) method based 
on called genotypes, especially in cases of extremely low coverage, 
small sample sizes, and high error rates. Because the original 
method has been previously shown to outperform other methods 
based directly on genotypes (Hall et al. 2012), the advantage of our 
method, in the presence of genotype uncertainty, should extend to 
these methods as well. Our analyses of simulated data further show 
that failing to use a correct prior can greatly affect downstream 
analyses. Genotype-calling errors can be more than twofold reduced 
by incorporating inbreeding into the genotype-calling algorithm, 
and there is an even more marked effect on the estimation of the 
SFS. Here, genotype-calling methods combined with erroneous 
assumptions of HWE when analyzing data from highly inbred 
species can lead to severe biases. Our real data analysis further 
supported these results. 

We note that this manuscript distinguishes between inbreeding 
per site and per individual, with the main algorithm focusing on 
individual inbreeding coefficients and their application in geno- 
type calling. The estimated inbreeding coefficient is a probability 
of identity by descent and is a property of an individual, implicitly 
assumed to be caused by cycles in the pedigree. As such, we do not 
attempt to assign particular individual segments as identical by 
descent (IBD) for genotype calling. Nevertheless, we note that the 
inference of individual IBD tracts, using hidden Markov model 
(HMM) style approaches, might improve both inferences regarding 
IBD and genotype calling. However, the implementations of such 
methods are computationally challenging, particularly because LD 
may strongly affect inferences regarding local IBD tracts (e.g., 
Moltke et al. 2011). 

As a final remark, although our lower tested coverage was 1 x, 
we expect our algorithm to perform equally well at ultra low cov- 
erages (e.g., O.lx or 0.5 x), given that enough variable sites with at 
least two sampled reads from the same individual are available (as 
a rule of thumb, at least around 1000). 

Methods 

Throughout this work we use the following notation: 

• n = number of individuals 

• k = number of loci 



• X a = read data for individual i at locus / 

• Gu = genotype of individual i at locus I (member of Z) 

• Z = [AA, Aa, aa\ or {0, 1, 2} 

• fi = allele frequency at locus / 

' fpq - frequency of genotype pq 

• Fi = inbreeding coefficient for individual i 

Furthermore, vectors and matrices are depicted in bold (e.g., 
F or X), while scalars are not. Parameter estimates are depicted with 
a hat (e.g., F), while intermediate iteration EM estimates are de- 
picted with a tilde (e.g., F ). When discussing methods for a single 
site, we drop the indicator for the identity of the site in the notation. 



EM algorithm for per-site inbreeding estimation 

For per-site inbreeding coefficients, the likelihood function, based 
on genotype likelihoods, is defined as 

i=l i=l Gez 

where p(X,|G) is the genotype likelihood and p(G\f, F) its prior 
(Eq. 2). An ML algorithm for maximizing this function is obtained 
by replacing the observed genotype counts in Equation 3 with the 
posterior expectation for genotype counts. To maximize the like- 
lihood function, we use an EM algorithm to, iteratively, improve 
estimates of f and F- Using p(G t = g\Xi) as a shorthand notation 
for p(Gi=g\Xi,f .F 1 ), the posterior probability of genotype g in 
individual ;': 



£ [p(G i = l\X i ) + 2p(G i = 2\X i ) 
' In 



F i+1 = 1 



2>(G, = 1|X,) 

i=l 

E[H E ] 



(9) 



(10) 



where E[H E ] is calculated as in Equation 6 replacing f with f . The 
posterior at the /th step of the iteration can be calculated as 

m-m- ^-^-fi^ . (id 

I p(X t \G)p(G\f',F>) 

A likelihood ratio can then be constructed by comparison of 
the likelihood function evaluated at the ML estimate of F and f 
to the likelihood assuming F = 0 to form a likelihood ratio test of 
the HWE. 



EM algorithm for per-individual inbreeding estimation 

There is little reason to assume that all individuals are equally in- 
bred. On the contrary, when averaged over many individuals, we 
would expect the same inbreeding coefficient in each site if there 
has been no natural selection for or against inbreeding. In ad- 
dition, inbreeding estimates based on individuals sites are likely 
to have large associated variances. For these reasons, priors for 
genotype calling are more conveniently based on inbreeding 
estimates that are allowed to vary among individuals, but not 
among sites. The following sections are devoted to describing such 
methods. 

Assuming independence among sites, the expectation of the 
log likelihood under this model is obtained by replacing the 
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indicator functions in Equation 7 with the posterior probability 
of the genotype. Using p(G u = g\X n ) as a shorthand notation for 
p(Gu = g\X ih f,,F i ): 

£[log(p(X|f, F))] = t i [p(G u = 0|X B )log[(l - f,f + (1 - f,)f,F,] 

+ p(G„ = l|X fl )log[2(l -f,)f,(l -Ft)} 

+ p(G a = 2|X„)log[/f + (1 - f,)f,F,]\ (12) 



Hall et al. (2012) have recently proposed an EM algorithm 
to estimate per-individual inbreeding coefficients from genotype 
data. To maximize Equation 12, we extend their method for the use 
of genotype likelihoods (instead of known genotypes). Adapting 
Equation 1 1 from their paper to account for genotype uncertainty, 
for an individual i: 

F / +1 =\i P(IBDi,\X tt ) = \ t £ \p{IBDl a \G)p{G\X a )\ (13) 
K ;=i K /=1 GeZ 1 J 

where p^IBD'^Xu) is the posterior probability that the two alleles at 
locus / are identical by descent (IBD) at iteration ;'. This can be 
calculated using p(G\Xu), the genotype posterior probability, and 
p(ffi£> a |G)as 



p(IBD> u \G) = 



p(G\IBD' n )p(IBD' n ) 

p{g\ibd^p{ibd^ 

+p(G\IBD' il )p(lBD i u ) 



(14) 



where p(IBD' u ) is the probability that two alleles at locus / are not 
IBD at iteration /'. In the end, Equation 13 results in 



F i+1 = l -i 



(\-f\)F>p(G H = Q\Xu) 

(i-fi)F>+(i-fi) 2 (i-p>) 

| f\F\p{G a = 2\X„) 



(15) 



leads to 



A similar extension to their update for allele frequencies (f\ + 1 ) 

t \p(Gu = 1 \X a ) + p(G u = 2\X U ) (2 - Pf) ' 



r 



i=l 



p(G n = 1 \X„) + p(G„ = 0\X U ) [2 - F'fj 
+p(G„ = l\X„)+p(G„ = 2\X a )(2-F\) 



(16) 



As pointed out by Hall et al. (2012), the EM algorithm can 
converge to a local rather than a global maximum (Wu 1983), 
and, for this reason, several different starting values should 
be used. Additionally, rather than using random values as initial 
values, Equation 5 can be used to obtain initial estimates off,, (Pj ) 
replacing observed genotype counts with their expected value. 

Approximated EM for per-individual inbreeding estimation 

The EM algorithm in Hall et al. (2012) is derived by treating the 
inbreeding status (inbred or not) in a single site as latent data. 
However, a faster algorithm can be derived by approximating an 
analytical solution to the maximization step for F t in Equation 12. 
This method is not guaranteed to converge to the global maxi- 
mum, but, since it initially converges considerably faster, it can be 
used in the initial iterations of the algorithm, greatly speeding up 
the previous method. 



For a particular individual, to maximize values of F,-, we find the 
partial derivative of Equation 12 in order with F t and set it equal to zero: 



diEpog(p(X|F,f)]] = £ 



dFi 



P(Gu 


= 0|X„)(1 


~f,)fi 




A) 2 +(i- 


fi)m 


p(G a = 


1\X U )2(1 


-m 


2(1 


-fdm- 


-Pt) 


p(G tt = 


2\X„)(1- 


Ml 



ff + (l-f,)f,Fi 



= 0. 



(17) 



Since this expression cannot be solved numerically, we ap- 
proximate it using an expansion around F t (current value off; in an 
iterative algorithm) to obtain an approximate expression that can 
be optimized analytically. Equation 17 is composed of functions of 
F of the form [a/(fi + Fc)], which can be expanded to 



b + Fc b + Fc (b + Fcf [ J 



(18) 



Ignoring terms of order (F - F) and higher, Equation 1 7 can 
then be rewritten as 



9/E[log(p(X|F,f))] 


k 

= I 


flo a 0 c 0 (Fi-Fi) 


dFi 


f=l 


bo + FiCo (bo+FiCof 






a\ a\C\(Fi - Fi) 






bi+Fid (bx+FiCxf 




+ 


a 2 azC2(Fi-Fi) 






b 2 +FiC 2 (b 2 +F i c 2 f 


where 






fl 0 = 


P(Gu 


= o|x 0 )(W,)f/ 


fli = 


P(G U 


= l|X 0 )2(l-f,)f, 


«2 = 


P(Gu 


= 2|X„)(l-f ( )f, 


b 0 = (l 


-fif 


co=(l-fi)f, 


h=2( 


l-f,)f, Cl = -2(1 -f,)f,. 


b 2 = ff 




C2={l-f,)f, 



0 (19) 



k 

X 

1=1 



F,= 



Solving for F, (for F, + \,fi+ 0): 



b a + F,c 0 (bo+FiCof 

fli ct\C\Fi 

-= — + . 

bi+Fid (bi+Fid)' 

a 2 a 2 c 2 Fi 
b 2 +F,c 2 (b 2 +FiC 2 )' 



k 

I 

1=1 



flO Co 



aid 



a 2 c 2 



(b 0 + FiCof (h+Fidf (b 2 + FiC 2 f 



(20) 



The algorithm then proceeds iteratively using Equation 19 
with F,=f' and F i =P' l + . As the algorithm proceeds, the differ- 
ence between F t and F ; decreases, providing a progressively more 
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accurate approximation. However, as the joint update of F and fis 
not a joint maximization of the same expected log likelihood, it 
may lead the algorithm to be stuck in saddle point. To ensure 
eventual convergence, we then revert to our extension of the Hall 
et al. (2012) method for the last iterations of the algorithm. 

Genotype calling 

To call genotypes, we use a Bayesian approach to integrate over 
several error sources including base quality score and mapping 
quality score. We use the genotype likelihood at each site / and for 
each individual i (Eq. 1), together with a prior, to calculate the 
posterior probability of the genotypes and call the genotype with 
the highest probability (Li 2011; Nielsen et al. 2011, 2012). As 
a prior we use either the expected genotype frequencies under (1) 
HWE or (2) HWE assuming the estimated inbreeding coefficients, 
using the MAF calculated according to Kim et al. (2011). 

Site frequency spectrum estimation 

Estimation of the SFS can be achieved in several ways. Standard 
SFS estimation methods rely on first calling genotypes and then 
calculating allele frequencies at each position, but this approach 
is prone to bias and can greatly influence the results, especially 
at low coverage (Johnson and Slatkin 2008). Here, we consider 
an extended version of the SFS (since we also consider sites in 
the alignment that are fixed) that avoids the genotype-calling 
step. Instead, this method bases its inferences on the posterior 
probability (calculated with a prior accounting for HWE de- 
viations) of the allele frequency for each site (Nielsen et al. 
2012). Correcting a typo in the Nielsen et al. (2012) section 
"Incorporating Deviations from Hardy- Weinberg Equilibrium" 
and suppressing the site index in the notation, their algorithm 
should be 

INITIALIZATION: 

Set h 0 =p(G 1 =0\X 1 ,f,F 1 ) 1 
h!=p(Gi = 
h 2 =p(G 1 =2\X 1 ,f,F 1 ) 
For; in 3,4, . . . ,2n : 
hj = 0 



RECURSION: 



For i in 2, 3, . . . ,n : 
For /in 2i,2/- 1,...,2 : 

Setfy=p(G,= 2\X 1 ,f,F i )h j . 2 
+p(G i = l\X i ,f,F i )h hl 
+p(G i = 0\X,,f,F i )h i 
Set h x = p(Gi = 0\X t , f,Fi)h! + p(G t = l|X f , f , P^ho 
Set h 0 =p(G i = 0\X i ,f,F i )h 0 

where p(G ; = g \ X h f, F ,) is the posterior probability for individual 
i and genotype g, using the ML estimates of f and Fj. For a global 
estimate of the SFS, we sum each category (hj) across all sites and 
condition the SFS to only include variable sites: 

SFS / = ^L,/e{l,2,3,...,2n-l}. 



NGS data simulation 

We performed extensive simulation studies to assess the perfor- 
mance of our methods and the effect of inbreeding on downstream 
analyses. Specifically, we assessed (1) the accuracy of the inbreeding 
coefficient estimates (both per site and per individual), (2) the im- 
pact of inbreeding on genotype calling, and (3) the influence of 
inbreeding in the estimation of the SFS. Due to computational 
constraints, we simulated mapped sequencing data rather than raw 
sequencing reads, similarly to previous studies (Kim et al. 2010, 
2011). Each individual genotype was simulated assuming di-allelic 
loci with a given MAF for each locus and inbreeding coefficient F. In 
each locus, the number of reads was drawn from a Poisson distri- 
bution with the mean equal to the specified individual sequencing 
coverage. To simulate errors, each read base was changed to any of 
the other nucleotides at an equal rate e/3, where £ is the error rate. 

We simulated 10,000 variable sites on 10, 30, and 50 individuals, 
over average sequencing coverages of 1, 2, 3, 5, and 10X, with error 
rates of 0.5%, 1%, and 2%, and varied inbreeding coefficients from 
0.0 to 1.0 in steps of 0.1, for a total of 495 combinations. With these 
parameter choices, we tried to focus on relatively extreme data sets 
(small sample sizes and low coverage), with realistic error rates (Glenn 
201 1) and covering biologically relevant scenarios of inbreeding from 
<0.07 in humans (Carothers et al. 2006) and —0.3 in dogs (Kirkness 
et al. 2003; Gray et al. 2009) to 0.4-0.98 in rice (Kovach et al. 2007) 
and 0.757 in wasps (Chapman and Stewart 1996). 

We also simulated an extra data set, for validation purposes, of 
1 million sites where only 1% are truly variable (true SNPs). We 
kept the same error rates, number of individuals, coverage, and 
inbreeding coefficients as before, for a total of 165 combinations of 
parameter values. Simulated data with only true SNPs, and with 
both true SNPs and invariable sites, yielded similar results (Sup- 
plemental Figs. 1-3, 5). For computational reasons, we therefore 
proceeded to use only the first data set in the rest of the analyses. 

To test our method under linked loci, we performed a couple 
more simulation analyses. First, we simulated half the previous 
number of variable sites (5000), under the same 495 parameter 
combinations as before. Second, we simulated a 5-Mb genomic 
region across 30 accessions from one rice population, using the 
software SFS_CODE (Hernandez 2008). We assumed an effective 
population size of 125,000 (Caicedo et al. 2007; Asano et al. 201 1), 
a mutation rate of 10~ 8 (Caicedo et al. 2007), a recombination rate 
of 4 cM/Mb (Tian et al. 2009; Asano et al. 2011), and two realistic 
self-pollinating rates of 0.7 and 0.95 (Oka 1988) ('-theta 0.005 -rho 
0.02 -self [0. 7,0.95] -sampSize 30'). We then used the program ART 
(Huang et al. 2012b) to simulate 2x coverage 100-bp mapped 
reads with no indels directly in SAM format ('-len 100 -fcov 2 -ir 
0 -dr 0 -ir2 0 -dr2 0 -qs 10 -qs2 10 -sam'). 

For the estimation of inbreeding coefficients (both from 
simulated and real data), we only use called SNPs with a log like- 
lihood ratio (LRT) >15.1366 ( x 2 ; P < 1 x 10~ 4 ; 1 d.f.), against the 
null hypothesis of ^=0, as implemented in the software ANGSD. 



Error estimates 

We calculated errors associated with the inbreeding coefficient es- 
timates (F), genotype calling, and SFS estimation. For inbreeding 
estimates and SFS estimation, we used the RMSD. More specifically: 



RMSD = 




where X true and X est are the true and estimated values of the pa- 
rameters, and S the total number of estimates. For estimates of f ind , 
S is the total number of individuals, for F site the effective number of 
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sites, and for the SFS the number of categories (S = 2n + 1). For 
genotype calling, the associated error was calculated as the pro- 
portion of miscalled genotypes. All plots were made using the 
R package ggplot2 (Wickham 2009). 

Analysis of real data 

In addition to simulated data, we also analyzed previously pub- 
lished Illumina GA II technology data from Rice, O. sativa (Xu et al. 
2011). These data consist of 40 domesticated rice accessions, rep- 
resentative of all major Asian rice groups (27 O. s. japonica and 13 
O. s. indica), together with five O. nivara and five O. mfipogon wild 
accessions at an effective (after mapping) sequencing coverage of 
lOx. The data set also includes an additional 15 wild rice acces- 
sions (10 O. mfipogon and five O. nivara) at an effective sequencing 
coverage of between 2x and 3x. 

We used the originally mapped reads but performed de novo 
quality controls using only sites with minimum root mean square 
(RMS) mapping quality >10, maximum P-value for (strand bias, 
base quality bias, map quality bias, end distance bias, and HWE 
excess of heterozygous exact test) >10~ 4 , and total coverage be- 
tween 57 x and 2645 x for 65 individuals, but where at least half 
the individuals had at least 2x coverage (Minoche et al. 2011). 
After filtering, we calculated the genotype likelihoods with the 
SAMtools program (Li et al. 2009b) and used them in all sub- 
sequent analyses. Again, we only used variable sites for the esti- 
mation of inbreeding coefficients. 

Software availability 

The methods presented in this work were implemented in C/C++ 
and are freely available for non-commercial use. The per-site in- 
breeding coefficient's (F stte ) estimation was incorporated into the 
software ANGSD, while the per-individual (F ind ) method was 
implemented in the stand-alone program ngsF. Both are available 
at http://cteg.berkeley.edu/~nielsen/resources/software/ or, in the 
case of ngsF, also at https://github.com/fgvieira/ngsF. 
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